You may add as many code and markdown cells as you see fit to answer the questions.
You will demonstrate your ability to merge, group, summarize, visualize, and find patterns in data. This exam uses data associated with a manufacturing example. An overview of the goals, considerations, CSV files, and variables within the data is provided in a presentation on Canvas. Please read through those slides before starting the exam.
The data are provided in 5 separate CSV files. The CSV files are available on Canvas. You MUST download the files and save them to the same working directory as this notebook.
The specific instructions in this notebook tell you when you must JOIN the data together. Please read the problems carefully.
The overall objective of this exam is to JOIN data from multiple files in order to explore and find interesting patterns between the machine operating conditions and supplier information. You will report your findings within this notebook by displaying Pandas DataFrames and statistical visualizations via Seaborn and matplotlib when necessary.
You are permitted to use the following modules on this exam.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
You may also use the following functions from scikit-learn on this exam.
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
You may also use the following sub module from SCIPY.
from scipy.cluster import hierarchy
You are NOT permitted to use any other modules or functions. However, you ARE permitted to create your own user defined functions if you would like.
The file names for the 3 machine data sets are provided as strings in the cell below. You are required to read in the CSV files and assign the data to the m01_df, m02_df, and m03_df objects. The data from machine 1 will therefore be associated with m01_df, machine 2 is associated with m02_df, and machine 3 is associated with m03_df.
In this problem you must explore each of the three machine data sets.
You must perform the following ESSENTIAL activities:
You must visually explore the MARGINAL behavior of the variables in the data. You must use visualizations appropriate for the DATA TYPE of the columns.
You must visually explore RELATIONSHIPS between variables in the data. You must use visualizations appropriate for the DATA TYPES. You must make sure that your visualizations can answer the following questions:
Batch are associated with each MACHINE (data set)?Batch value for each MACHINE (data set)?x1 through x4 vary across the three MACHINES?x1 through x4 vary across the Batch values?x1 through x4 vary across the three MACHINES?x1 through x4 vary across the Batch values?At the conclusion of this problem, you MUST CONCATENATE the 3 MACHINE data sets into a single DataFrame. The single DataFrame must be named machine_df. Before concatenating, you MUST add a column machine_id to each DataFrame with the correct index value for that machine (1, 2, or 3). The concatenating DataFrame variable name is provided as a reminder to you below.
You may add as many markdown and code cells as you see fit to answer this question. Include markdown cells stating what you see in the figures and why you selected to use them.
# Define the files's for the 3 machine level CSV files
file_m01 = 'midterm_machine_01.csv'
file_m02 = 'midterm_machine_02.csv'
file_m03 = 'midterm_machine_03.csv'
# read in the CSV files and name them accordingly
m01_df = pd.read_csv(file_m01)
m02_df = pd.read_csv(file_m02)
m03_df = pd.read_csv(file_m03)
m01_df Essentials¶m01_df.head()
| ID | Batch | s_id | x1 | x2 | x3 | x4 | |
|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S001 | 1 | 1 | 50.117118 | 102.167346 | 22.067812 | 13.889524 |
| 1 | B001-M01-S002 | 1 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 |
| 2 | B001-M01-S003 | 1 | 3 | 50.132744 | 103.674908 | 22.319933 | 13.647482 |
| 3 | B001-M01-S004 | 1 | 4 | 48.501042 | 107.143156 | 22.162947 | 14.077758 |
| 4 | B001-M01-S005 | 1 | 5 | 49.690442 | 102.120283 | 22.248696 | 13.728666 |
m01_df.shape
(5152, 7)
m01_df has 5152 rows and 7 columns.m01_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5152 entries, 0 to 5151 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5152 non-null object 1 Batch 5152 non-null int64 2 s_id 5152 non-null int64 3 x1 5152 non-null float64 4 x2 5152 non-null float64 5 x3 5152 non-null float64 6 x4 5152 non-null float64 dtypes: float64(4), int64(2), object(1) memory usage: 281.9+ KB
x1, x2, x3, x4 are of float data types.Batch, s_id are of integer data types.ID is of object data type.m01_df['Batch'] = m01_df['Batch'].astype('object')
m01_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5152 entries, 0 to 5151 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5152 non-null object 1 Batch 5152 non-null object 2 s_id 5152 non-null int64 3 x1 5152 non-null float64 4 x2 5152 non-null float64 5 x3 5152 non-null float64 6 x4 5152 non-null float64 dtypes: float64(4), int64(1), object(2) memory usage: 281.9+ KB
m01_df.nunique()
ID 5152 Batch 50 s_id 149 x1 5152 x2 5152 x3 5152 x4 5152 dtype: int64
ID, x1, x2, x3, x4 have 5152 unique values each.Batch has 50 unique values.s_id has 149 unique values.m01_df.isna().sum()
ID 0 Batch 0 s_id 0 x1 0 x2 0 x3 0 x4 0 dtype: int64
m01_df Marginal Behavior¶m01_df.describe()
| s_id | x1 | x2 | x3 | x4 | |
|---|---|---|---|---|---|
| count | 5152.000000 | 5152.000000 | 5152.000000 | 5152.000000 | 5152.000000 |
| mean | 54.540373 | 51.998218 | 102.207011 | 23.947216 | 10.628436 |
| std | 33.650665 | 3.151912 | 12.888171 | 2.269325 | 1.542291 |
| min | 1.000000 | 44.275379 | 65.810729 | 19.095832 | 7.644190 |
| 25% | 26.000000 | 49.335776 | 96.610288 | 22.334048 | 9.479969 |
| 50% | 52.000000 | 52.536206 | 104.487824 | 23.855690 | 10.356002 |
| 75% | 80.000000 | 54.662490 | 110.609442 | 25.448356 | 11.318120 |
| max | 149.000000 | 58.647186 | 130.228828 | 29.477213 | 14.612911 |
m01_df.describe(include=object)
| ID | Batch | |
|---|---|---|
| count | 5152 | 5152 |
| unique | 5152 | 50 |
| top | B001-M01-S001 | 36 |
| freq | 1 | 149 |
sns.catplot(data = m01_df, x='ID', kind='count', aspect=3)
plt.show()
sns.catplot(data = m01_df, x='Batch', kind='count', aspect=3)
plt.show()
50 unique values for Batch associated with Machine 1m01_df Relationships¶m01_df_features = m01_df.select_dtypes('number').copy()
m01_df_features = m01_df_features.drop(columns=['s_id'])
m01_df_features
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| 0 | 50.117118 | 102.167346 | 22.067812 | 13.889524 |
| 1 | 46.887333 | 112.266102 | 23.197330 | 13.064709 |
| 2 | 50.132744 | 103.674908 | 22.319933 | 13.647482 |
| 3 | 48.501042 | 107.143156 | 22.162947 | 14.077758 |
| 4 | 49.690442 | 102.120283 | 22.248696 | 13.728666 |
| ... | ... | ... | ... | ... |
| 5147 | 51.989030 | 101.587723 | 19.945482 | 11.456779 |
| 5148 | 52.331889 | 105.846820 | 20.836308 | 11.048374 |
| 5149 | 53.910224 | 99.050877 | 21.227331 | 10.697194 |
| 5150 | 51.021515 | 107.294368 | 20.914290 | 11.052844 |
| 5151 | 52.903667 | 98.787587 | 20.689594 | 11.190619 |
5152 rows × 4 columns
m01_lf = m01_df_features.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['rowid'])
m01_lf
| rowid | variable | value | |
|---|---|---|---|
| 0 | 0 | x1 | 50.117118 |
| 1 | 1 | x1 | 46.887333 |
| 2 | 2 | x1 | 50.132744 |
| 3 | 3 | x1 | 48.501042 |
| 4 | 4 | x1 | 49.690442 |
| ... | ... | ... | ... |
| 20603 | 5147 | x4 | 11.456779 |
| 20604 | 5148 | x4 | 11.048374 |
| 20605 | 5149 | x4 | 10.697194 |
| 20606 | 5150 | x4 | 11.052844 |
| 20607 | 5151 | x4 | 11.190619 |
20608 rows × 3 columns
m01_objects = m01_df.select_dtypes('object').copy()
m01_objects
| ID | Batch | |
|---|---|---|
| 0 | B001-M01-S001 | 1 |
| 1 | B001-M01-S002 | 1 |
| 2 | B001-M01-S003 | 1 |
| 3 | B001-M01-S004 | 1 |
| 4 | B001-M01-S005 | 1 |
| ... | ... | ... |
| 5147 | B050-M01-S105 | 50 |
| 5148 | B050-M01-S106 | 50 |
| 5149 | B050-M01-S107 | 50 |
| 5150 | B050-M01-S108 | 50 |
| 5151 | B050-M01-S109 | 50 |
5152 rows × 2 columns
id_cols = ['rowid'] + m01_objects.columns.to_list()
id_cols
['rowid', 'ID', 'Batch']
m01_objects_lf = m01_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=m01_df_features.columns)
m01_objects_lf
| rowid | ID | Batch | variable | value | |
|---|---|---|---|---|---|
| 0 | 0 | B001-M01-S001 | 1 | x1 | 50.117118 |
| 1 | 1 | B001-M01-S002 | 1 | x1 | 46.887333 |
| 2 | 2 | B001-M01-S003 | 1 | x1 | 50.132744 |
| 3 | 3 | B001-M01-S004 | 1 | x1 | 48.501042 |
| 4 | 4 | B001-M01-S005 | 1 | x1 | 49.690442 |
| ... | ... | ... | ... | ... | ... |
| 20603 | 5147 | B050-M01-S105 | 50 | x4 | 11.456779 |
| 20604 | 5148 | B050-M01-S106 | 50 | x4 | 11.048374 |
| 20605 | 5149 | B050-M01-S107 | 50 | x4 | 10.697194 |
| 20606 | 5150 | B050-M01-S108 | 50 | x4 | 11.052844 |
| 20607 | 5151 | B050-M01-S109 | 50 | x4 | 11.190619 |
20608 rows × 5 columns
sns.displot(data = m01_lf, x='value', col='variable', kind='hist', kde=True,
facet_kws={'sharex': False, 'sharey': False},
common_bins=False, col_wrap=2)
plt.show()
sns.displot(data = m01_df, x='s_id', kind='hist', kde=True)
plt.show()
sns.catplot(data = m01_objects_lf, x='Batch', y='value', col='variable', col_wrap=1, aspect=3,
kind='box',
sharey=False)
plt.show()
m01_df_features
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| 0 | 50.117118 | 102.167346 | 22.067812 | 13.889524 |
| 1 | 46.887333 | 112.266102 | 23.197330 | 13.064709 |
| 2 | 50.132744 | 103.674908 | 22.319933 | 13.647482 |
| 3 | 48.501042 | 107.143156 | 22.162947 | 14.077758 |
| 4 | 49.690442 | 102.120283 | 22.248696 | 13.728666 |
| ... | ... | ... | ... | ... |
| 5147 | 51.989030 | 101.587723 | 19.945482 | 11.456779 |
| 5148 | 52.331889 | 105.846820 | 20.836308 | 11.048374 |
| 5149 | 53.910224 | 99.050877 | 21.227331 | 10.697194 |
| 5150 | 51.021515 | 107.294368 | 20.914290 | 11.052844 |
| 5151 | 52.903667 | 98.787587 | 20.689594 | 11.190619 |
5152 rows × 4 columns
m01_df_features_batch = m01_df_features.copy()
m01_df_features_batch['Batch'] = m01_df['Batch']
m01_df_features_batch
| x1 | x2 | x3 | x4 | Batch | |
|---|---|---|---|---|---|
| 0 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | 1 |
| 1 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | 1 |
| 3 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | 1 |
| 4 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | 1 |
| ... | ... | ... | ... | ... | ... |
| 5147 | 51.989030 | 101.587723 | 19.945482 | 11.456779 | 50 |
| 5148 | 52.331889 | 105.846820 | 20.836308 | 11.048374 | 50 |
| 5149 | 53.910224 | 99.050877 | 21.227331 | 10.697194 | 50 |
| 5150 | 51.021515 | 107.294368 | 20.914290 | 11.052844 | 50 |
| 5151 | 52.903667 | 98.787587 | 20.689594 | 11.190619 | 50 |
5152 rows × 5 columns
sns.pairplot(data = m01_df_features_batch, hue='Batch',
diag_kws={'common_norm': False}, palette='viridis')
plt.show()
corr_per_group_all = m01_df_features_batch.groupby('Batch').corr(numeric_only=True)
corr_per_group_all
| x1 | x2 | x3 | x4 | ||
|---|---|---|---|---|---|
| Batch | |||||
| 1 | x1 | 1.000000 | -0.799209 | -0.120857 | -0.022968 |
| x2 | -0.799209 | 1.000000 | 0.157464 | 0.000275 | |
| x3 | -0.120857 | 0.157464 | 1.000000 | -0.767872 | |
| x4 | -0.022968 | 0.000275 | -0.767872 | 1.000000 | |
| 2 | x1 | 1.000000 | -0.768966 | 0.069225 | -0.076870 |
| ... | ... | ... | ... | ... | ... |
| 49 | x4 | 0.159566 | -0.257262 | -0.808746 | 1.000000 |
| 50 | x1 | 1.000000 | -0.717987 | -0.108621 | 0.091202 |
| x2 | -0.717987 | 1.000000 | 0.076936 | -0.070050 | |
| x3 | -0.108621 | 0.076936 | 1.000000 | -0.719884 | |
| x4 | 0.091202 | -0.070050 | -0.719884 | 1.000000 |
200 rows × 4 columns
the_groups = m01_df_features_batch.Batch.unique().tolist()
print(the_groups)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
sublists = [the_groups[i:i + 5] for i in range(0, len(the_groups), 5)]
for i in sublists:
fig, axs = plt.subplots(1, len(i), figsize=(18, 8), sharex=True, sharey=True )
for ix in range(len(i)):
sns.heatmap( data = corr_per_group_all.loc[ i[ ix ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 10},
ax=axs[ix] )
axs[ ix ].set_title('Batch: %s' % i[ ix ] )
plt.show()
sns.pairplot(data = m01_df)
plt.show()
m02_df Overview¶m02_df.shape
(5119, 7)
m02_df has 5119 rows and 7 columns.m02_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5119 entries, 0 to 5118 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5119 non-null object 1 Batch 5119 non-null int64 2 s_id 5119 non-null int64 3 x1 5119 non-null float64 4 x2 5119 non-null float64 5 x3 5119 non-null float64 6 x4 5119 non-null float64 dtypes: float64(4), int64(2), object(1) memory usage: 280.1+ KB
x1, x2, x3, x4 are of float data types.Batch, s_id are of integer data types.ID is of object data type.m02_df['Batch'] = m02_df['Batch'].astype('object')
m02_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5119 entries, 0 to 5118 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5119 non-null object 1 Batch 5119 non-null object 2 s_id 5119 non-null int64 3 x1 5119 non-null float64 4 x2 5119 non-null float64 5 x3 5119 non-null float64 6 x4 5119 non-null float64 dtypes: float64(4), int64(1), object(2) memory usage: 280.1+ KB
m02_df.nunique()
ID 5119 Batch 50 s_id 133 x1 5119 x2 5119 x3 5119 x4 5119 dtype: int64
ID, x1, x2, x3, x4 have 5119 unique values each.Batch has 50 unique values.s_id has 133 unique values.m02_df.isna().sum()
ID 0 Batch 0 s_id 0 x1 0 x2 0 x3 0 x4 0 dtype: int64
m02_df Marginal Behavior¶sns.catplot(data = m02_df, x='Batch', kind='count', aspect=3)
plt.show()
sns.catplot(data = m02_df, x='ID', kind='count', aspect=3)
plt.show()
50 unique values for Batch associated with Machine 2m02_df Relationships¶m02_df.describe()
| s_id | x1 | x2 | x3 | x4 | |
|---|---|---|---|---|---|
| count | 5119.000000 | 5119.000000 | 5119.000000 | 5119.000000 | 5119.000000 |
| mean | 52.754444 | 51.989377 | 101.998905 | 23.982312 | 10.605275 |
| std | 31.263876 | 3.095129 | 12.834348 | 2.275809 | 1.523140 |
| min | 1.000000 | 44.896759 | 64.774007 | 19.232164 | 7.934064 |
| 25% | 26.000000 | 49.240567 | 96.325502 | 22.372944 | 9.492226 |
| 50% | 52.000000 | 52.610949 | 104.357308 | 23.894465 | 10.329911 |
| 75% | 78.000000 | 54.785418 | 110.672960 | 25.546656 | 11.286062 |
| max | 133.000000 | 58.503441 | 131.009046 | 29.093229 | 14.400437 |
m02_df.describe(include=object)
| ID | Batch | |
|---|---|---|
| count | 5119 | 5119 |
| unique | 5119 | 50 |
| top | B001-M02-S001 | 43 |
| freq | 1 | 133 |
m02_df_features = m02_df.select_dtypes('number').copy()
m02_df_features = m02_df_features.drop(columns=['s_id'])
m02_df_features
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| 0 | 50.213596 | 100.053754 | 22.027835 | 13.839467 |
| 1 | 49.811232 | 101.161409 | 22.436839 | 13.878907 |
| 2 | 48.968142 | 106.184030 | 22.414990 | 13.847003 |
| 3 | 50.477133 | 107.949816 | 21.909720 | 14.193081 |
| 4 | 50.188501 | 102.882549 | 22.306728 | 13.693529 |
| ... | ... | ... | ... | ... |
| 5114 | 52.690936 | 95.050804 | 20.324674 | 10.963370 |
| 5115 | 52.348015 | 97.863144 | 20.884779 | 11.156604 |
| 5116 | 52.187241 | 99.887013 | 20.439142 | 11.303222 |
| 5117 | 52.875621 | 101.572057 | 20.648029 | 11.084512 |
| 5118 | 52.491445 | 96.744458 | 20.814762 | 10.886910 |
5119 rows × 4 columns
m02_lf = m02_df_features.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['rowid'])
m02_lf
| rowid | variable | value | |
|---|---|---|---|
| 0 | 0 | x1 | 50.213596 |
| 1 | 1 | x1 | 49.811232 |
| 2 | 2 | x1 | 48.968142 |
| 3 | 3 | x1 | 50.477133 |
| 4 | 4 | x1 | 50.188501 |
| ... | ... | ... | ... |
| 20471 | 5114 | x4 | 10.963370 |
| 20472 | 5115 | x4 | 11.156604 |
| 20473 | 5116 | x4 | 11.303222 |
| 20474 | 5117 | x4 | 11.084512 |
| 20475 | 5118 | x4 | 10.886910 |
20476 rows × 3 columns
m02_objects = m02_df.select_dtypes('object').copy()
m02_objects
| ID | Batch | |
|---|---|---|
| 0 | B001-M02-S001 | 1 |
| 1 | B001-M02-S002 | 1 |
| 2 | B001-M02-S003 | 1 |
| 3 | B001-M02-S004 | 1 |
| 4 | B001-M02-S005 | 1 |
| ... | ... | ... |
| 5114 | B050-M02-S092 | 50 |
| 5115 | B050-M02-S093 | 50 |
| 5116 | B050-M02-S094 | 50 |
| 5117 | B050-M02-S095 | 50 |
| 5118 | B050-M02-S096 | 50 |
5119 rows × 2 columns
id_cols2 = ['rowid'] + m02_objects.columns.to_list()
id_cols2
['rowid', 'ID', 'Batch']
m02_objects_lf = m02_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols2, value_vars=m02_df_features.columns)
m02_objects_lf
| rowid | ID | Batch | variable | value | |
|---|---|---|---|---|---|
| 0 | 0 | B001-M02-S001 | 1 | x1 | 50.213596 |
| 1 | 1 | B001-M02-S002 | 1 | x1 | 49.811232 |
| 2 | 2 | B001-M02-S003 | 1 | x1 | 48.968142 |
| 3 | 3 | B001-M02-S004 | 1 | x1 | 50.477133 |
| 4 | 4 | B001-M02-S005 | 1 | x1 | 50.188501 |
| ... | ... | ... | ... | ... | ... |
| 20471 | 5114 | B050-M02-S092 | 50 | x4 | 10.963370 |
| 20472 | 5115 | B050-M02-S093 | 50 | x4 | 11.156604 |
| 20473 | 5116 | B050-M02-S094 | 50 | x4 | 11.303222 |
| 20474 | 5117 | B050-M02-S095 | 50 | x4 | 11.084512 |
| 20475 | 5118 | B050-M02-S096 | 50 | x4 | 10.886910 |
20476 rows × 5 columns
sns.displot(data = m02_lf, x='value', col='variable', kind='hist', kde=True,
facet_kws={'sharex': False, 'sharey': False},
common_bins=False, col_wrap=2)
plt.show()
sns.displot(data = m02_df, x='s_id', kind='hist', kde=True)
plt.show()
sns.catplot(data = m02_objects_lf, x='Batch', y='value', col='variable', col_wrap=1, aspect=3,
kind='box',
sharey=False)
plt.show()
m02_df_features
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| 0 | 50.213596 | 100.053754 | 22.027835 | 13.839467 |
| 1 | 49.811232 | 101.161409 | 22.436839 | 13.878907 |
| 2 | 48.968142 | 106.184030 | 22.414990 | 13.847003 |
| 3 | 50.477133 | 107.949816 | 21.909720 | 14.193081 |
| 4 | 50.188501 | 102.882549 | 22.306728 | 13.693529 |
| ... | ... | ... | ... | ... |
| 5114 | 52.690936 | 95.050804 | 20.324674 | 10.963370 |
| 5115 | 52.348015 | 97.863144 | 20.884779 | 11.156604 |
| 5116 | 52.187241 | 99.887013 | 20.439142 | 11.303222 |
| 5117 | 52.875621 | 101.572057 | 20.648029 | 11.084512 |
| 5118 | 52.491445 | 96.744458 | 20.814762 | 10.886910 |
5119 rows × 4 columns
m02_df_features_batch = m02_df_features.copy()
m02_df_features_batch['Batch'] = m02_df['Batch']
m02_df_features_batch
| x1 | x2 | x3 | x4 | Batch | |
|---|---|---|---|---|---|
| 0 | 50.213596 | 100.053754 | 22.027835 | 13.839467 | 1 |
| 1 | 49.811232 | 101.161409 | 22.436839 | 13.878907 | 1 |
| 2 | 48.968142 | 106.184030 | 22.414990 | 13.847003 | 1 |
| 3 | 50.477133 | 107.949816 | 21.909720 | 14.193081 | 1 |
| 4 | 50.188501 | 102.882549 | 22.306728 | 13.693529 | 1 |
| ... | ... | ... | ... | ... | ... |
| 5114 | 52.690936 | 95.050804 | 20.324674 | 10.963370 | 50 |
| 5115 | 52.348015 | 97.863144 | 20.884779 | 11.156604 | 50 |
| 5116 | 52.187241 | 99.887013 | 20.439142 | 11.303222 | 50 |
| 5117 | 52.875621 | 101.572057 | 20.648029 | 11.084512 | 50 |
| 5118 | 52.491445 | 96.744458 | 20.814762 | 10.886910 | 50 |
5119 rows × 5 columns
sns.pairplot(data = m02_df_features_batch, hue='Batch',
diag_kws={'common_norm': False}, palette='viridis')
plt.show()
corr_per_group_all2 = m02_df_features_batch.groupby('Batch').corr(numeric_only=True)
corr_per_group_all2
| x1 | x2 | x3 | x4 | ||
|---|---|---|---|---|---|
| Batch | |||||
| 1 | x1 | 1.000000 | 0.163656 | -0.217603 | -0.001689 |
| x2 | 0.163656 | 1.000000 | -0.142951 | -0.069450 | |
| x3 | -0.217603 | -0.142951 | 1.000000 | -0.173068 | |
| x4 | -0.001689 | -0.069450 | -0.173068 | 1.000000 | |
| 2 | x1 | 1.000000 | 0.190067 | -0.039772 | -0.118070 |
| ... | ... | ... | ... | ... | ... |
| 49 | x4 | 0.085552 | -0.031279 | -0.256291 | 1.000000 |
| 50 | x1 | 1.000000 | 0.198618 | 0.074470 | -0.037296 |
| x2 | 0.198618 | 1.000000 | 0.017599 | -0.132803 | |
| x3 | 0.074470 | 0.017599 | 1.000000 | -0.040388 | |
| x4 | -0.037296 | -0.132803 | -0.040388 | 1.000000 |
200 rows × 4 columns
the_groups2 = m02_df_features_batch.Batch.unique().tolist()
print(the_groups2)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
sublists2 = [the_groups2[i:i + 5] for i in range(0, len(the_groups2), 5)]
for i in sublists2:
fig, axs = plt.subplots(1, len(i), figsize=(18, 8), sharex=True, sharey=True )
for ix in range(len(i)):
sns.heatmap( data = corr_per_group_all2.loc[ i[ ix ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 10},
ax=axs[ix] )
axs[ ix ].set_title('Batch: %s' % i[ ix ] )
plt.show()
sns.pairplot(data = m02_df)
plt.show()
m03_df Overview¶m03_df.shape
(4458, 7)
m03_df has 4458 rows and 7columns.m03_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4458 entries, 0 to 4457 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4458 non-null object 1 Batch 4458 non-null int64 2 s_id 4458 non-null int64 3 x1 4458 non-null float64 4 x2 4458 non-null float64 5 x3 4458 non-null float64 6 x4 4458 non-null float64 dtypes: float64(4), int64(2), object(1) memory usage: 243.9+ KB
x1, x2, x3, x4 are of float data types.Batch, s_id are of integer data types.ID is of object data type.m03_df['Batch'] = m03_df['Batch'].astype('object')
m03_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4458 entries, 0 to 4457 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4458 non-null object 1 Batch 4458 non-null object 2 s_id 4458 non-null int64 3 x1 4458 non-null float64 4 x2 4458 non-null float64 5 x3 4458 non-null float64 6 x4 4458 non-null float64 dtypes: float64(4), int64(1), object(2) memory usage: 243.9+ KB
m03_df.nunique()
ID 4458 Batch 41 s_id 141 x1 4458 x2 4458 x3 4458 x4 4458 dtype: int64
ID, x1, x2, x3, x4 have 4458 unique values each.Batch has 41 unique values.s_id has 141 unique values.m03_df.isna().sum()
ID 0 Batch 0 s_id 0 x1 0 x2 0 x3 0 x4 0 dtype: int64
m03_df Marginal Behavior¶m03_df.describe()
| s_id | x1 | x2 | x3 | x4 | |
|---|---|---|---|---|---|
| count | 4458.000000 | 4458.000000 | 4458.000000 | 4458.000000 | 4458.000000 |
| mean | 56.475774 | 51.857672 | 101.650057 | 23.927679 | 10.602371 |
| std | 33.827427 | 3.022646 | 12.243012 | 2.035196 | 1.554708 |
| min | 1.000000 | 45.690397 | 70.192150 | 19.865338 | 8.155668 |
| 25% | 28.000000 | 49.228020 | 95.381831 | 22.457249 | 9.475467 |
| 50% | 55.000000 | 52.598712 | 104.712727 | 23.900193 | 10.351781 |
| 75% | 83.000000 | 54.810528 | 110.050211 | 25.142293 | 11.301126 |
| max | 141.000000 | 56.981029 | 126.332105 | 28.972747 | 14.182623 |
m03_df.describe(include=object)
| ID | Batch | |
|---|---|---|
| count | 4458 | 4458 |
| unique | 4458 | 41 |
| top | B001-M03-S001 | 5 |
| freq | 1 | 141 |
sns.catplot(data = m03_df, x='ID', kind='count', aspect=3)
plt.show()
sns.catplot(data = m03_df, x='Batch', kind='count', aspect=3)
plt.show()
41 unique values for Batch associated with Machine 2m03_df_features = m03_df.select_dtypes('number').copy()
m03_df_features = m03_df_features.drop(columns=['s_id'])
m03_df_features
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| 0 | 49.317142 | 103.793803 | 21.848400 | 13.539224 |
| 1 | 49.983523 | 106.266817 | 22.530393 | 13.741064 |
| 2 | 48.762702 | 98.864632 | 22.462081 | 13.818300 |
| 3 | 50.009580 | 104.689498 | 22.369850 | 13.849127 |
| 4 | 48.892358 | 102.974600 | 22.657941 | 13.795078 |
| ... | ... | ... | ... | ... |
| 4453 | 49.278415 | 102.089925 | 22.396979 | 13.753264 |
| 4454 | 49.264754 | 104.619874 | 21.977716 | 13.636539 |
| 4455 | 49.465358 | 102.867473 | 22.257845 | 13.678206 |
| 4456 | 49.751785 | 104.863427 | 22.461204 | 13.880084 |
| 4457 | 49.819212 | 104.075176 | 22.462165 | 13.853014 |
4458 rows × 4 columns
m03_lf = m03_df_features.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=['rowid'])
m03_lf
| rowid | variable | value | |
|---|---|---|---|
| 0 | 0 | x1 | 49.317142 |
| 1 | 1 | x1 | 49.983523 |
| 2 | 2 | x1 | 48.762702 |
| 3 | 3 | x1 | 50.009580 |
| 4 | 4 | x1 | 48.892358 |
| ... | ... | ... | ... |
| 17827 | 4453 | x4 | 13.753264 |
| 17828 | 4454 | x4 | 13.636539 |
| 17829 | 4455 | x4 | 13.678206 |
| 17830 | 4456 | x4 | 13.880084 |
| 17831 | 4457 | x4 | 13.853014 |
17832 rows × 3 columns
m03_objects = m03_df.select_dtypes('object').copy()
m03_objects
| ID | Batch | |
|---|---|---|
| 0 | B001-M03-S001 | 1 |
| 1 | B001-M03-S002 | 1 |
| 2 | B001-M03-S003 | 1 |
| 3 | B001-M03-S004 | 1 |
| 4 | B001-M03-S005 | 1 |
| ... | ... | ... |
| 4453 | B049-M03-S100 | 49 |
| 4454 | B049-M03-S101 | 49 |
| 4455 | B049-M03-S102 | 49 |
| 4456 | B049-M03-S103 | 49 |
| 4457 | B049-M03-S104 | 49 |
4458 rows × 2 columns
id_cols3 = ['rowid'] + m03_objects.columns.to_list()
id_cols3
['rowid', 'ID', 'Batch']
m03_objects_lf = m03_df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols3, value_vars=m03_df_features.columns)
m03_objects_lf
| rowid | ID | Batch | variable | value | |
|---|---|---|---|---|---|
| 0 | 0 | B001-M03-S001 | 1 | x1 | 49.317142 |
| 1 | 1 | B001-M03-S002 | 1 | x1 | 49.983523 |
| 2 | 2 | B001-M03-S003 | 1 | x1 | 48.762702 |
| 3 | 3 | B001-M03-S004 | 1 | x1 | 50.009580 |
| 4 | 4 | B001-M03-S005 | 1 | x1 | 48.892358 |
| ... | ... | ... | ... | ... | ... |
| 17827 | 4453 | B049-M03-S100 | 49 | x4 | 13.753264 |
| 17828 | 4454 | B049-M03-S101 | 49 | x4 | 13.636539 |
| 17829 | 4455 | B049-M03-S102 | 49 | x4 | 13.678206 |
| 17830 | 4456 | B049-M03-S103 | 49 | x4 | 13.880084 |
| 17831 | 4457 | B049-M03-S104 | 49 | x4 | 13.853014 |
17832 rows × 5 columns
sns.displot(data = m03_lf, x='value', col='variable', kind='hist', kde=True,
facet_kws={'sharex': False, 'sharey': False},
common_bins=False, col_wrap=2)
plt.show()
sns.displot(data = m03_df, x='s_id', kind='hist', kde=True)
plt.show()
sns.catplot(data = m03_objects_lf, x='Batch', y='value', col='variable', col_wrap=1, aspect=3,
kind='box',
sharey=False)
plt.show()
m03_df_features
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| 0 | 49.317142 | 103.793803 | 21.848400 | 13.539224 |
| 1 | 49.983523 | 106.266817 | 22.530393 | 13.741064 |
| 2 | 48.762702 | 98.864632 | 22.462081 | 13.818300 |
| 3 | 50.009580 | 104.689498 | 22.369850 | 13.849127 |
| 4 | 48.892358 | 102.974600 | 22.657941 | 13.795078 |
| ... | ... | ... | ... | ... |
| 4453 | 49.278415 | 102.089925 | 22.396979 | 13.753264 |
| 4454 | 49.264754 | 104.619874 | 21.977716 | 13.636539 |
| 4455 | 49.465358 | 102.867473 | 22.257845 | 13.678206 |
| 4456 | 49.751785 | 104.863427 | 22.461204 | 13.880084 |
| 4457 | 49.819212 | 104.075176 | 22.462165 | 13.853014 |
4458 rows × 4 columns
m03_df_features_batch = m03_df_features.copy()
m03_df_features_batch['Batch'] = m03_df['Batch']
m03_df_features_batch
| x1 | x2 | x3 | x4 | Batch | |
|---|---|---|---|---|---|
| 0 | 49.317142 | 103.793803 | 21.848400 | 13.539224 | 1 |
| 1 | 49.983523 | 106.266817 | 22.530393 | 13.741064 | 1 |
| 2 | 48.762702 | 98.864632 | 22.462081 | 13.818300 | 1 |
| 3 | 50.009580 | 104.689498 | 22.369850 | 13.849127 | 1 |
| 4 | 48.892358 | 102.974600 | 22.657941 | 13.795078 | 1 |
| ... | ... | ... | ... | ... | ... |
| 4453 | 49.278415 | 102.089925 | 22.396979 | 13.753264 | 49 |
| 4454 | 49.264754 | 104.619874 | 21.977716 | 13.636539 | 49 |
| 4455 | 49.465358 | 102.867473 | 22.257845 | 13.678206 | 49 |
| 4456 | 49.751785 | 104.863427 | 22.461204 | 13.880084 | 49 |
| 4457 | 49.819212 | 104.075176 | 22.462165 | 13.853014 | 49 |
4458 rows × 5 columns
sns.pairplot(data = m03_df_features_batch, hue='Batch',
diag_kws={'common_norm': False}, palette='viridis')
plt.show()
corr_per_group_all3 = m03_df_features_batch.groupby('Batch').corr(numeric_only=True)
corr_per_group_all3
| x1 | x2 | x3 | x4 | ||
|---|---|---|---|---|---|
| Batch | |||||
| 1 | x1 | 1.000000 | 0.760277 | 0.127774 | -0.027709 |
| x2 | 0.760277 | 1.000000 | 0.096853 | 0.016254 | |
| x3 | 0.127774 | 0.096853 | 1.000000 | 0.744956 | |
| x4 | -0.027709 | 0.016254 | 0.744956 | 1.000000 | |
| 2 | x1 | 1.000000 | 0.685391 | -0.004765 | 0.046219 |
| ... | ... | ... | ... | ... | ... |
| 48 | x4 | 0.064493 | -0.042554 | 0.764414 | 1.000000 |
| 49 | x1 | 1.000000 | 0.792663 | -0.032332 | -0.095601 |
| x2 | 0.792663 | 1.000000 | -0.031371 | -0.091763 | |
| x3 | -0.032332 | -0.031371 | 1.000000 | 0.745093 | |
| x4 | -0.095601 | -0.091763 | 0.745093 | 1.000000 |
164 rows × 4 columns
corr_per_group_all3
| x1 | x2 | x3 | x4 | ||
|---|---|---|---|---|---|
| Batch | |||||
| 1 | x1 | 1.000000 | 0.760277 | 0.127774 | -0.027709 |
| x2 | 0.760277 | 1.000000 | 0.096853 | 0.016254 | |
| x3 | 0.127774 | 0.096853 | 1.000000 | 0.744956 | |
| x4 | -0.027709 | 0.016254 | 0.744956 | 1.000000 | |
| 2 | x1 | 1.000000 | 0.685391 | -0.004765 | 0.046219 |
| ... | ... | ... | ... | ... | ... |
| 48 | x4 | 0.064493 | -0.042554 | 0.764414 | 1.000000 |
| 49 | x1 | 1.000000 | 0.792663 | -0.032332 | -0.095601 |
| x2 | 0.792663 | 1.000000 | -0.031371 | -0.091763 | |
| x3 | -0.032332 | -0.031371 | 1.000000 | 0.745093 | |
| x4 | -0.095601 | -0.091763 | 0.745093 | 1.000000 |
164 rows × 4 columns
the_groups3 = m03_df_features_batch.Batch.unique().tolist()
print(the_groups3)
[1, 2, 3, 4, 5, 6, 9, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
len(the_groups3)
41
sublists3 = [the_groups3[i:i + 5] for i in range(0, len(the_groups3) - 1, 5)]
for i in sublists3:
fig, axs = plt.subplots(1, len(i), figsize=(18, 8), sharex=True, sharey=True )
for ix in range(len(i)):
sns.heatmap( data = corr_per_group_all3.loc[ i[ ix ] ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 10},
ax=axs[ix] )
axs[ ix ].set_title('Batch: %s' % i[ ix ] )
plt.show()
sublists31 = the_groups3[len(the_groups3) - 1]
sublists31
49
fig, axs = plt.subplots(figsize=(10, 6) )
sns.heatmap( data = corr_per_group_all3.loc[ sublists31 ],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 10},
ax=axs )
axs.set_title('Batch: %s' % sublists31 )
plt.show()
m01_df['machine_id'] = 1
m02_df['machine_id'] = 2
m03_df['machine_id'] = 3
# concatenate the 3 DataFrames into a single DataFrame which includes the `machine_id` variable
machine_df = pd.concat([m01_df, m02_df, m03_df])
machine_df
| ID | Batch | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S001 | 1 | 1 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | 1 |
| 1 | B001-M01-S002 | 1 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | B001-M01-S003 | 1 | 3 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | 1 |
| 3 | B001-M01-S004 | 1 | 4 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | 1 |
| 4 | B001-M01-S005 | 1 | 5 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4453 | B049-M03-S100 | 49 | 100 | 49.278415 | 102.089925 | 22.396979 | 13.753264 | 3 |
| 4454 | B049-M03-S101 | 49 | 101 | 49.264754 | 104.619874 | 21.977716 | 13.636539 | 3 |
| 4455 | B049-M03-S102 | 49 | 102 | 49.465358 | 102.867473 | 22.257845 | 13.678206 | 3 |
| 4456 | B049-M03-S103 | 49 | 103 | 49.751785 | 104.863427 | 22.461204 | 13.880084 | 3 |
| 4457 | B049-M03-S104 | 49 | 104 | 49.819212 | 104.075176 | 22.462165 | 13.853014 | 3 |
14729 rows × 8 columns
listx = ['x1', 'x2', 'x3', 'x4']
for i in listx:
sns.catplot(data=machine_df, x='machine_id', y=i, kind='box')
plt.show()
groups4 = machine_df.machine_id.unique().tolist()
groups4
[1, 2, 3]
corr_per_group4 = machine_df.loc[ :, ['x1', 'x2', 'x3', 'x4', 'machine_id']].groupby(['machine_id']).corr()
corr_per_group4
| x1 | x2 | x3 | x4 | ||
|---|---|---|---|---|---|
| machine_id | |||||
| 1 | x1 | 1.000000 | 0.725695 | 0.117574 | -0.191357 |
| x2 | 0.725695 | 1.000000 | 0.064586 | 0.091319 | |
| x3 | 0.117574 | 0.064586 | 1.000000 | -0.584667 | |
| x4 | -0.191357 | 0.091319 | -0.584667 | 1.000000 | |
| 2 | x1 | 1.000000 | 0.850194 | 0.110249 | -0.187127 |
| x2 | 0.850194 | 1.000000 | 0.056123 | 0.102422 | |
| x3 | 0.110249 | 0.056123 | 1.000000 | -0.569695 | |
| x4 | -0.187127 | 0.102422 | -0.569695 | 1.000000 | |
| 3 | x1 | 1.000000 | 0.883526 | 0.093064 | -0.199686 |
| x2 | 0.883526 | 1.000000 | -0.024685 | 0.114277 | |
| x3 | 0.093064 | -0.024685 | 1.000000 | -0.573415 | |
| x4 | -0.199686 | 0.114277 | -0.573415 | 1.000000 |
fig, axs = plt.subplots(1, len(groups4), figsize=(10, 3), sharex=True, sharey=True)
for ix in range(len(groups4)):
sns.heatmap(data = corr_per_group4.loc[groups4[ix]],
vmin=-1, vmax=1, center = 0,
cmap='coolwarm', cbar=False,
annot=True, annot_kws={'size': 6},
ax=axs[ix])
axs[ix].set_title('machine_id: %s' % groups4[ix] )
plt.show()
listx = ['x1', 'x2', 'x3', 'x4']
for i in listx:
sns.displot(data=machine_df, x=i, hue='Batch', kind='kde')
plt.show()
The supplier batch data set file name is provided for you below. You must read in the CSV file and assign the data set to the batch_df object.
You must perform the following ESSENTIAL activities:
You must visually explore the MARGINAL behavior of the variables in the data. You must use visualizations appropriate for the DATA TYPE of the columns.
You must visually explore RELATIONSHIPS between variables in the data. You must use visualizations appropriate for the DATA TYPES. You must make sure that your visualizations can answer the following questions:
Density depend on the Supplier?Density depend on the Supplier?Density relate to Batch for each Supplier?After exploring the batch_df DataFrame, you MUST JOIN/MERGE the batch_df DataFrame with the machine_df DataFrame. Assign the merged DataFrame to the dfa DataFrame.
You can now explore the relationships between the MACHINE OPERATIONAL VARIABLES and the SUPPLIERS! You must use visualizations to explore the following relationships:
x1 through x4 vary across Batch for each MACHINE given each Supplier. Your figures MUST use Batch as the x-axis variable.x1 through x4 vary across Supplier.You may add as many markdown and code cells as you see fit to answer this question.
# define the batch supplier file
batch_file = 'midterm_supplier.csv'
# read in the batch supplier data set
batch_df = pd.read_csv(batch_file)
batch_df
| Batch | Supplier | Density | |
|---|---|---|---|
| 0 | 1 | B | 10.388587 |
| 1 | 2 | A | 9.296880 |
| 2 | 3 | A | 9.565636 |
| 3 | 4 | B | 10.005387 |
| 4 | 5 | A | 9.339177 |
| 5 | 6 | A | 12.246089 |
| 6 | 7 | A | 9.695526 |
| 7 | 8 | A | 9.966524 |
| 8 | 9 | B | 9.394692 |
| 9 | 10 | B | 9.289811 |
| 10 | 11 | A | 9.176533 |
| 11 | 12 | A | 9.095899 |
| 12 | 13 | A | 9.500716 |
| 13 | 14 | B | 10.089895 |
| 14 | 15 | B | 10.062224 |
| 15 | 16 | B | 10.083865 |
| 16 | 17 | A | 10.398586 |
| 17 | 18 | A | 10.943962 |
| 18 | 19 | A | 8.703627 |
| 19 | 20 | B | 9.720199 |
| 20 | 21 | B | 9.043240 |
| 21 | 22 | A | 9.057283 |
| 22 | 23 | B | 10.049708 |
| 23 | 24 | B | 10.234062 |
| 24 | 25 | A | 11.963966 |
| 25 | 26 | B | 10.265196 |
| 26 | 27 | B | 9.901614 |
| 27 | 28 | A | 10.499405 |
| 28 | 29 | A | 8.756711 |
| 29 | 30 | B | 10.182992 |
| 30 | 31 | A | 10.179347 |
| 31 | 32 | A | 9.243669 |
| 32 | 33 | A | 9.293373 |
| 33 | 34 | A | 9.034611 |
| 34 | 35 | B | 10.045503 |
| 35 | 36 | A | 9.341639 |
| 36 | 37 | A | 8.967161 |
| 37 | 38 | A | 11.589538 |
| 38 | 39 | A | 10.284600 |
| 39 | 40 | A | 9.420988 |
| 40 | 41 | A | 9.558478 |
| 41 | 42 | B | 9.664472 |
| 42 | 43 | A | 9.624945 |
| 43 | 44 | A | 9.090145 |
| 44 | 45 | B | 9.363151 |
| 45 | 46 | B | 9.729321 |
| 46 | 47 | A | 10.862295 |
| 47 | 48 | B | 9.638677 |
| 48 | 49 | A | 9.111256 |
| 49 | 50 | B | 9.935676 |
batch_df.shape
(50, 3)
batch_df has 50 rows and 3 columns.batch_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Batch 50 non-null int64 1 Supplier 50 non-null object 2 Density 50 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 1.3+ KB
Batch is integer data type.Supplier is object data type.Density is float data type.batch_df.nunique()
Batch 50 Supplier 2 Density 50 dtype: int64
Batch and Density have 50 unique values each.Supplier has 2 unique variables.batch_df.isna().sum()
Batch 0 Supplier 0 Density 0 dtype: int64
batch_df.batch_df['Batch'] = batch_df['Batch'].astype('object')
batch_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Batch 50 non-null object 1 Supplier 50 non-null object 2 Density 50 non-null float64 dtypes: float64(1), object(2) memory usage: 1.3+ KB
batch_df Marginal Behavior¶sns.catplot( data = batch_df, x='Batch', kind='count', aspect=3 )
plt.show()
sns.catplot( data = batch_df, x='Supplier', kind='count' )
plt.show()
sns.displot(data = batch_df, x='Density', kind='hist', kde=True)
plt.show()
batch_df Relationship¶sns.displot(data = batch_df, x='Density', hue='Supplier',
common_norm=False, kind='kde')
plt.show()
sns.catplot(data = batch_df, x='Supplier', y='Density', kind='box',
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
sns.catplot(data = batch_df, x='Supplier', y='Density', kind='point', aspect=2, join=False)
plt.show()
Density depend on the Supplier.batch_df['Batch'] = batch_df.Batch.astype('int64')
sns.lmplot(data = batch_df, x = 'Batch', y = 'Density', hue = 'Supplier')
plt.show()
batch_df.head()
| Batch | Supplier | Density | |
|---|---|---|---|
| 0 | 1 | B | 10.388587 |
| 1 | 2 | A | 9.296880 |
| 2 | 3 | A | 9.565636 |
| 3 | 4 | B | 10.005387 |
| 4 | 5 | A | 9.339177 |
machine_df.head()
| ID | Batch | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S001 | 1 | 1 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | 1 |
| 1 | B001-M01-S002 | 1 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | B001-M01-S003 | 1 | 3 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | 1 |
| 3 | B001-M01-S004 | 1 | 4 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | 1 |
| 4 | B001-M01-S005 | 1 | 5 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | 1 |
# merge the batch supplier data set with the (concatenated) machine data set
dfa = pd.merge( batch_df, machine_df, on='Batch' )
dfa
| Batch | Supplier | Density | ID | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | B | 10.388587 | B001-M01-S001 | 1 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | 1 |
| 1 | 1 | B | 10.388587 | B001-M01-S002 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | 1 | B | 10.388587 | B001-M01-S003 | 3 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | 1 |
| 3 | 1 | B | 10.388587 | B001-M01-S004 | 4 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | 1 |
| 4 | 1 | B | 10.388587 | B001-M01-S005 | 5 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14724 | 50 | B | 9.935676 | B050-M02-S092 | 92 | 52.690936 | 95.050804 | 20.324674 | 10.963370 | 2 |
| 14725 | 50 | B | 9.935676 | B050-M02-S093 | 93 | 52.348015 | 97.863144 | 20.884779 | 11.156604 | 2 |
| 14726 | 50 | B | 9.935676 | B050-M02-S094 | 94 | 52.187241 | 99.887013 | 20.439142 | 11.303222 | 2 |
| 14727 | 50 | B | 9.935676 | B050-M02-S095 | 95 | 52.875621 | 101.572057 | 20.648029 | 11.084512 | 2 |
| 14728 | 50 | B | 9.935676 | B050-M02-S096 | 96 | 52.491445 | 96.744458 | 20.814762 | 10.886910 | 2 |
14729 rows × 10 columns
dfa.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 14729 entries, 0 to 14728 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Batch 14729 non-null object 1 Supplier 14729 non-null object 2 Density 14729 non-null float64 3 ID 14729 non-null object 4 s_id 14729 non-null int64 5 x1 14729 non-null float64 6 x2 14729 non-null float64 7 x3 14729 non-null float64 8 x4 14729 non-null float64 9 machine_id 14729 non-null int64 dtypes: float64(5), int64(2), object(3) memory usage: 1.2+ MB
dfa.describe()
| Density | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|
| count | 14729.000000 | 14729.000000 | 14729.000000 | 14729.000000 | 14729.000000 | 14729.000000 | 14729.000000 |
| mean | 9.735339 | 54.505465 | 51.952606 | 101.966113 | 23.953500 | 10.612497 | 1.952882 |
| std | 0.729389 | 32.927545 | 3.093916 | 12.678550 | 2.203383 | 1.539387 | 0.806398 |
| min | 8.703627 | 1.000000 | 44.275379 | 64.774007 | 19.095832 | 7.644190 | 1.000000 |
| 25% | 9.289811 | 27.000000 | 49.257492 | 96.189442 | 22.394977 | 9.481281 | 1.000000 |
| 50% | 9.624945 | 53.000000 | 52.586639 | 104.533114 | 23.883004 | 10.344656 | 2.000000 |
| 75% | 10.083865 | 80.000000 | 54.757788 | 110.415780 | 25.343060 | 11.300265 | 3.000000 |
| max | 12.246089 | 149.000000 | 58.647186 | 131.009046 | 29.477213 | 14.612911 | 3.000000 |
dfa.describe(include=object)
| Batch | Supplier | ID | |
|---|---|---|---|
| count | 14729 | 14729 | 14729 |
| unique | 50 | 2 | 14729 |
| top | 12 | A | B001-M01-S001 |
| freq | 380 | 9140 | 1 |
dfa_copy = dfa[['x1', 'x2', 'x3', 'x4', 'Batch', 'Supplier']].copy()
dfa_copy
| x1 | x2 | x3 | x4 | Batch | Supplier | |
|---|---|---|---|---|---|---|
| 0 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | 1 | B |
| 1 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 | B |
| 2 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | 1 | B |
| 3 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | 1 | B |
| 4 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | 1 | B |
| ... | ... | ... | ... | ... | ... | ... |
| 14724 | 52.690936 | 95.050804 | 20.324674 | 10.963370 | 50 | B |
| 14725 | 52.348015 | 97.863144 | 20.884779 | 11.156604 | 50 | B |
| 14726 | 52.187241 | 99.887013 | 20.439142 | 11.303222 | 50 | B |
| 14727 | 52.875621 | 101.572057 | 20.648029 | 11.084512 | 50 | B |
| 14728 | 52.491445 | 96.744458 | 20.814762 | 10.886910 | 50 | B |
14729 rows × 6 columns
dfa_copy.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 14729 entries, 0 to 14728 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 14729 non-null float64 1 x2 14729 non-null float64 2 x3 14729 non-null float64 3 x4 14729 non-null float64 4 Batch 14729 non-null object 5 Supplier 14729 non-null object dtypes: float64(4), object(2) memory usage: 805.5+ KB
operating variables across Batch given a Supplier.¶sns.catplot(data = dfa_copy, x='Batch', y='x1', hue='Supplier', kind='box', aspect=3,
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
sns.catplot(data = dfa_copy, x='Batch', y='x1', row='Supplier', kind='box', aspect=2,
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
sns.catplot(data = dfa_copy, x='Batch', y='x2', hue='Supplier', kind='box', aspect=3,
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
sns.catplot(data = dfa_copy, x='Batch', y='x2', row='Supplier', kind='box', aspect=2,
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
sns.catplot(data = dfa_copy, x='Batch', y='x3', hue='Supplier', kind='box', aspect=3,
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
sns.catplot(data = dfa_copy, x='Batch', y='x3', row='Supplier', kind='box', aspect=2,
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
sns.catplot(data = dfa_copy, x='Batch', y='x3', hue='Supplier', kind='box', aspect=3,
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
sns.catplot(data = dfa_copy, x='Batch', y='x4', row='Supplier', kind='box', aspect=2,
showmeans=True,
meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
operating variables across Supplier.¶dfa_copy_copy = dfa_copy[['x1', 'x2', 'x3', 'x4', 'Supplier']].copy()
dfa_copy_copy
| x1 | x2 | x3 | x4 | Supplier | |
|---|---|---|---|---|---|
| 0 | 50.117118 | 102.167346 | 22.067812 | 13.889524 | B |
| 1 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | B |
| 2 | 50.132744 | 103.674908 | 22.319933 | 13.647482 | B |
| 3 | 48.501042 | 107.143156 | 22.162947 | 14.077758 | B |
| 4 | 49.690442 | 102.120283 | 22.248696 | 13.728666 | B |
| ... | ... | ... | ... | ... | ... |
| 14724 | 52.690936 | 95.050804 | 20.324674 | 10.963370 | B |
| 14725 | 52.348015 | 97.863144 | 20.884779 | 11.156604 | B |
| 14726 | 52.187241 | 99.887013 | 20.439142 | 11.303222 | B |
| 14727 | 52.875621 | 101.572057 | 20.648029 | 11.084512 | B |
| 14728 | 52.491445 | 96.744458 | 20.814762 | 10.886910 | B |
14729 rows × 5 columns
sns.pairplot(data=dfa_copy_copy, hue='Supplier')
plt.show()
The DROP TEST result data set file name is provided for you below. You must read in the CSV file and assign the dta set to the test_df object.
You must perform the following ESSENTIAL activities:
You must visually explore the MARGINAL behavior of the variables in the data. You must use visualizations appropriate for the DATA TYPE of the columns.
You must visually explore RELATIONSHIPS between variables in the data. You must use visualizations appropriate for the DATA TYPES. You must make sure that your visualizations can answer the following questions:
Result occurs for each test_group_id value.After exploring the test_df DataFrame, you MUST JOIN/MERGE the test_df DataFrame with the dfa DataFrame. Assign the merged DataFrame to the dfb DataFrame. You MUST answer the following:
You may add as many markdown and code cells as you see fit to answer this question.
# define the test data set file name
test_file = 'midterm_test.csv'
# read in the test data set
test_df = pd.read_csv(test_file)
test_df
| ID | test_group_id | Result | |
|---|---|---|---|
| 0 | B001-M01-S056 | A-0 | 1 |
| 1 | B001-M01-S002 | A-0 | 1 |
| 2 | B001-M01-S049 | A-0 | 1 |
| 3 | B001-M01-S030 | A-0 | 0 |
| 4 | B001-M01-S048 | A-0 | 1 |
| ... | ... | ... | ... |
| 1407 | B049-M03-S026 | K-5 | 1 |
| 1408 | B049-M03-S008 | K-5 | 1 |
| 1409 | B049-M03-S041 | K-5 | 1 |
| 1410 | B049-M03-S061 | K-5 | 1 |
| 1411 | B049-M03-S051 | K-5 | 1 |
1412 rows × 3 columns
test_df Essentials¶test_df.shape
(1412, 3)
test_df has 1412 rows and 3 columns.test_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1412 entries, 0 to 1411 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 1412 non-null object 1 test_group_id 1412 non-null object 2 Result 1412 non-null int64 dtypes: int64(1), object(2) memory usage: 33.2+ KB
ID and test_group_id are of object data type.Result is of integer data type.test_df.nunique()
ID 1412 test_group_id 141 Result 2 dtype: int64
ID has 1412 unique valuestest_group_id has 141 unique values.Result has 2 unique values.test_df.isna().sum()
ID 0 test_group_id 0 Result 0 dtype: int64
test_df Marginal Behavior¶test_df.describe()
| Result | |
|---|---|
| count | 1412.000000 |
| mean | 0.301700 |
| std | 0.459158 |
| min | 0.000000 |
| 25% | 0.000000 |
| 50% | 0.000000 |
| 75% | 1.000000 |
| max | 1.000000 |
test_df.describe(include=object)
| ID | test_group_id | |
|---|---|---|
| count | 1412 | 1412 |
| unique | 1412 | 141 |
| top | B001-M01-S056 | L-4 |
| freq | 1 | 14 |
sns.catplot( data = test_df, x='Result', kind='count' )
plt.show()
sns.catplot( data = test_df, x='ID', kind='count', aspect=5 )
plt.show()
sns.catplot( data = test_df, y='test_group_id', kind='count', height=18 )
plt.show()
test_df Relationships¶sns.catplot(data = test_df, y='test_group_id', hue='Result', height=18, kind='count')
plt.show()
sns.catplot(data = test_df, x='test_group_id', row='Result', kind='count', aspect=5)
plt.show()
fig, ax = plt.subplots(figsize=(3,28))
sns.heatmap( pd.crosstab( test_df.test_group_id, test_df.Result ), ax = ax,
annot=True, annot_kws={'size': 10}, fmt='d')
plt.show()
Result for each test_group_id.# merge test_df with the dfa object
dfb = pd.merge( test_df, dfa, on='ID' )
dfb
| ID | test_group_id | Result | Batch | Supplier | Density | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S056 | A-0 | 1 | 1 | B | 10.388587 | 56 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 |
| 1 | B001-M01-S002 | A-0 | 1 | 1 | B | 10.388587 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | B001-M01-S049 | A-0 | 1 | 1 | B | 10.388587 | 49 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 |
| 3 | B001-M01-S030 | A-0 | 0 | 1 | B | 10.388587 | 30 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 1 |
| 4 | B001-M01-S048 | A-0 | 1 | 1 | B | 10.388587 | 48 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1407 | B049-M03-S026 | K-5 | 1 | 49 | A | 9.111256 | 26 | 49.464765 | 102.967447 | 22.411845 | 13.919089 | 3 |
| 1408 | B049-M03-S008 | K-5 | 1 | 49 | A | 9.111256 | 8 | 49.408191 | 103.516814 | 21.966617 | 13.730119 | 3 |
| 1409 | B049-M03-S041 | K-5 | 1 | 49 | A | 9.111256 | 41 | 49.605196 | 103.460366 | 21.932429 | 13.790280 | 3 |
| 1410 | B049-M03-S061 | K-5 | 1 | 49 | A | 9.111256 | 61 | 49.716703 | 104.346466 | 22.059022 | 13.543388 | 3 |
| 1411 | B049-M03-S051 | K-5 | 1 | 49 | A | 9.111256 | 51 | 49.885679 | 104.690007 | 22.055338 | 13.882302 | 3 |
1412 rows × 12 columns
dfb.shape
(1412, 12)
You must now examine the merged dfb object and answer the following:
Result occurs for each value of machine_id.Result occurs for each value of Supplier.Result occurs per Batch for each value of machine_id.Result occurs per Batch for each value of machine_id and Supplier.Batch per machine_id.Batch per machine_id and for each unique value of Supplier.HINT: Remember that a FAILED test is encoded as Result == 1. How can you calculate the PROPORTION of times Result == 1?
Add as many cells as you see fit to answer this question.
dfb.head()
| ID | test_group_id | Result | Batch | Supplier | Density | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S056 | A-0 | 1 | 1 | B | 10.388587 | 56 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 |
| 1 | B001-M01-S002 | A-0 | 1 | 1 | B | 10.388587 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | B001-M01-S049 | A-0 | 1 | 1 | B | 10.388587 | 49 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 |
| 3 | B001-M01-S030 | A-0 | 0 | 1 | B | 10.388587 | 30 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 1 |
| 4 | B001-M01-S048 | A-0 | 1 | 1 | B | 10.388587 | 48 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 |
dfb.shape
(1412, 12)
dfb.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1412 entries, 0 to 1411 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 1412 non-null object 1 test_group_id 1412 non-null object 2 Result 1412 non-null int64 3 Batch 1412 non-null object 4 Supplier 1412 non-null object 5 Density 1412 non-null float64 6 s_id 1412 non-null int64 7 x1 1412 non-null float64 8 x2 1412 non-null float64 9 x3 1412 non-null float64 10 x4 1412 non-null float64 11 machine_id 1412 non-null int64 dtypes: float64(5), int64(3), object(4) memory usage: 143.4+ KB
dfb.nunique()
ID 1412 test_group_id 141 Result 2 Batch 50 Supplier 2 Density 50 s_id 134 x1 1412 x2 1412 x3 1412 x4 1412 machine_id 3 dtype: int64
sns.catplot(data=dfb, x='machine_id', hue='Result', kind='count')
plt.show()
0 Results and around 160 1 Results.0 Results and around 140 1 Results.0 Results and around 120 1 Results.sns.catplot(data=dfb, x='Supplier', hue='Result', kind='count')
plt.show()
0 Results and around 140 1 Results.0 Results and around 300 1 Results.sns.catplot(data=dfb, x='machine_id', hue='Result', col='Batch', kind='count', col_wrap=4, sharex=False, sharey=False)
plt.show()
sns.catplot(data=dfb, x='Result', hue='machine_id', col='Supplier', row='Batch', kind='count', sharex=False)
plt.show()
dfb.head()
| ID | test_group_id | Result | Batch | Supplier | Density | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S056 | A-0 | 1 | 1 | B | 10.388587 | 56 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 |
| 1 | B001-M01-S002 | A-0 | 1 | 1 | B | 10.388587 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | B001-M01-S049 | A-0 | 1 | 1 | B | 10.388587 | 49 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 |
| 3 | B001-M01-S030 | A-0 | 0 | 1 | B | 10.388587 | 30 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 1 |
| 4 | B001-M01-S048 | A-0 | 1 | 1 | B | 10.388587 | 48 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 |
dfb_prop = dfb[['Batch', 'machine_id', 'Result', 'Supplier']].copy()
dfb_prop
| Batch | machine_id | Result | Supplier | |
|---|---|---|---|---|
| 0 | 1 | 1 | 1 | B |
| 1 | 1 | 1 | 1 | B |
| 2 | 1 | 1 | 1 | B |
| 3 | 1 | 1 | 0 | B |
| 4 | 1 | 1 | 1 | B |
| ... | ... | ... | ... | ... |
| 1407 | 49 | 3 | 1 | A |
| 1408 | 49 | 3 | 1 | A |
| 1409 | 49 | 3 | 1 | A |
| 1410 | 49 | 3 | 1 | A |
| 1411 | 49 | 3 | 1 | A |
1412 rows × 4 columns
failures = dfb_prop[dfb_prop['Result'] == 1].groupby(['Batch', 'machine_id'])['Result'].count()
failures
Batch machine_id
1 1 6
2 10
3 8
2 1 3
2 2
..
49 1 9
2 11
3 10
50 1 1
2 1
Name: Result, Length: 107, dtype: int64
total = dfb_prop.groupby(['Batch', 'machine_id'])['Result'].count()
total
Batch machine_id
1 1 7
2 10
3 8
2 1 13
2 11
..
49 1 9
2 11
3 10
50 1 10
2 9
Name: Result, Length: 141, dtype: int64
proportion_failures = failures / total
proportion_failures = proportion_failures.reset_index()
proportion_failures
| Batch | machine_id | Result | |
|---|---|---|---|
| 0 | 1 | 1 | 0.857143 |
| 1 | 1 | 2 | 1.000000 |
| 2 | 1 | 3 | 1.000000 |
| 3 | 2 | 1 | 0.230769 |
| 4 | 2 | 2 | 0.181818 |
| ... | ... | ... | ... |
| 136 | 49 | 1 | 1.000000 |
| 137 | 49 | 2 | 1.000000 |
| 138 | 49 | 3 | 1.000000 |
| 139 | 50 | 1 | 0.100000 |
| 140 | 50 | 2 | 0.111111 |
141 rows × 3 columns
failures1 = dfb[dfb['Result'] == 1].groupby(['Batch', 'machine_id', 'Supplier'])['Result'].count()
total1 = dfb.groupby(['Batch', 'machine_id', 'Supplier'])['Result'].count()
proportion_failures1 = failures1 / total1
proportion_failures1 = proportion_failures1.reset_index()
sns.catplot(x='Batch', y='Result', hue='Supplier', col='machine_id',\
data=proportion_failures1, kind='bar', aspect=3, col_wrap=1, sharex=False)
plt.show()
You must cluster the rows of dfb using the 4 operational variables x1 through x4. You must decide how many clusters to use and describe how you made that choice. You may use KMeans OR Hierarchical clustering. Include any figures that helped you make that choice.
Visualize your cluster analysis results by:
You are interested in the PROPORTION of cell phone cases that failed the DROP TEST. Are any of the clusters associated with higher failure PROPORTIONS than others? Based on your visualizations how would you describe that cluster?
Add as many cells as you see fit to answer this question.
dfb.head()
| ID | test_group_id | Result | Batch | Supplier | Density | s_id | x1 | x2 | x3 | x4 | machine_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B001-M01-S056 | A-0 | 1 | 1 | B | 10.388587 | 56 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 |
| 1 | B001-M01-S002 | A-0 | 1 | 1 | B | 10.388587 | 2 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | B001-M01-S049 | A-0 | 1 | 1 | B | 10.388587 | 49 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 |
| 3 | B001-M01-S030 | A-0 | 0 | 1 | B | 10.388587 | 30 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 1 |
| 4 | B001-M01-S048 | A-0 | 1 | 1 | B | 10.388587 | 48 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 |
dfb_clean = dfb.copy()
dfb_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1412 entries, 0 to 1411 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 1412 non-null object 1 test_group_id 1412 non-null object 2 Result 1412 non-null int64 3 Batch 1412 non-null object 4 Supplier 1412 non-null object 5 Density 1412 non-null float64 6 s_id 1412 non-null int64 7 x1 1412 non-null float64 8 x2 1412 non-null float64 9 x3 1412 non-null float64 10 x4 1412 non-null float64 11 machine_id 1412 non-null int64 dtypes: float64(5), int64(3), object(4) memory usage: 143.4+ KB
dfb_features_clean = dfb_clean[['x1', 'x2', 'x3', 'x4', 'Result']].copy()
dfb_features_clean
| x1 | x2 | x3 | x4 | Result | |
|---|---|---|---|---|---|
| 0 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 |
| 1 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 |
| 2 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 |
| 3 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 0 |
| 4 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 |
| ... | ... | ... | ... | ... | ... |
| 1407 | 49.464765 | 102.967447 | 22.411845 | 13.919089 | 1 |
| 1408 | 49.408191 | 103.516814 | 21.966617 | 13.730119 | 1 |
| 1409 | 49.605196 | 103.460366 | 21.932429 | 13.790280 | 1 |
| 1410 | 49.716703 | 104.346466 | 22.059022 | 13.543388 | 1 |
| 1411 | 49.885679 | 104.690007 | 22.055338 | 13.882302 | 1 |
1412 rows × 5 columns
dfb_features_clean['Result'] = dfb_features_clean.Result.astype('category')
dfb_features_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1412 entries, 0 to 1411 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 1412 non-null float64 1 x2 1412 non-null float64 2 x3 1412 non-null float64 3 x4 1412 non-null float64 4 Result 1412 non-null category dtypes: category(1), float64(4) memory usage: 56.7 KB
X = StandardScaler().fit_transform( dfb_features_clean )
X.shape
(1412, 5)
dfb_features_clean.shape
(1412, 5)
sns.catplot(data = pd.DataFrame(X, columns=dfb_features_clean.columns), kind='box', aspect=2)
plt.show()
clusters_2 = KMeans(n_clusters=2, random_state=121, n_init=25, max_iter=500).fit_predict( X )
dfb_features_clean_copy = dfb_features_clean.copy()
dfb_features_clean_copy['k2'] = pd.Series( clusters_2, index=dfb_features_clean_copy.index ).astype('category')
dfb_features_clean_copy.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1412 entries, 0 to 1411 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 1412 non-null float64 1 x2 1412 non-null float64 2 x3 1412 non-null float64 3 x4 1412 non-null float64 4 Result 1412 non-null category 5 k2 1412 non-null category dtypes: category(2), float64(4) memory usage: 58.2 KB
dfb_features_clean_copy.k2.value_counts()
1 752 0 660 Name: k2, dtype: int64
sns.pairplot(data = dfb_features_clean_copy, hue='k2', diag_kws={'common_norm': False})
plt.show()
fig, ax = plt.subplots()
sns.heatmap(data = pd.crosstab( dfb_features_clean_copy.Result, dfb_features_clean_copy.k2, margins=True ),
annot=True, annot_kws={"fontsize": 20}, fmt='g',
cbar=False,
ax=ax)
plt.show()
sns.relplot(data = dfb_features_clean_copy, x='x1', y='x2', hue='k2', style='Result')
plt.show()
sns.relplot(data = dfb_features_clean_copy, x='x1', y='x3', hue='k2', style='Result')
plt.show()
sns.relplot(data = dfb_features_clean_copy, x='x1', y='x4', hue='k2', style='Result')
plt.show()
sns.relplot(data = dfb_features_clean_copy, x='x3', y='x2', hue='k2', style='Result')
plt.show()
sns.relplot(data = dfb_features_clean_copy, x='x4', y='x2', hue='k2', style='Result')
plt.show()
sns.relplot(data = dfb_features_clean_copy, x='x3', y='x4', hue='k2', style='Result')
plt.show()
tots_within = []
K = range(1, 31)
for k in K:
km = KMeans(n_clusters=k, random_state=121, n_init=25, max_iter=500)
km = km.fit( X )
tots_within.append( km.inertia_ )
fig, ax = plt.subplots()
ax.plot( K, tots_within, 'bo-' )
ax.set_xlabel('number of clusters')
ax.set_ylabel('total within sum of squares')
plt.show()
clusters_4 = KMeans(n_clusters=4, random_state=121, n_init=25, max_iter=500).fit_predict( X )
dfb_features_clean_copy['k4'] = pd.Series( clusters_4, index=dfb_features_clean_copy.index ).astype('category')
sns.pairplot(data = dfb_features_clean_copy, hue='k4', diag_kws={'common_norm': False})
plt.show()
clusters_5 = KMeans(n_clusters=5, random_state=121, n_init=25, max_iter=500).fit_predict( X )
dfb_features_clean_copy['k5'] = pd.Series( clusters_5, index=dfb_features_clean_copy.index ).astype('category')
sns.pairplot(data = dfb_features_clean_copy, hue='k5', diag_kws={'common_norm': False})
plt.show()
from scipy.cluster import hierarchy
hclust_complete = hierarchy.complete(X)
type(hclust_complete)
numpy.ndarray
fig = plt.figure(figsize=(12,6))
dn = hierarchy.dendrogram( hclust_complete, no_labels=True )
plt.show()
hierarchy.cut_tree( hclust_complete, n_clusters=4 ).ravel()
array([0, 0, 0, ..., 0, 0, 0])
np.unique( hierarchy.cut_tree( hclust_complete, n_clusters=4).ravel() )
array([0, 1, 2, 3])
pca_dfb = PCA(n_components=4).fit_transform( X )
pca_dfb.shape
(1412, 4)
pca_dfb_df = pd.DataFrame( pca_dfb, columns=['pc01', 'pc02', 'pc03', 'pc04'] )
pca_dfb_df
| pc01 | pc02 | pc03 | pc04 | |
|---|---|---|---|---|
| 0 | -2.463565 | -1.786034 | 0.438785 | 0.543170 |
| 1 | -2.054846 | -1.056957 | 1.012786 | 0.774621 |
| 2 | -1.907154 | -1.650158 | 0.967256 | 0.776832 |
| 3 | -1.170109 | -1.032959 | -0.669684 | 1.902278 |
| 4 | -1.946382 | -1.927281 | 0.712119 | 0.517157 |
| ... | ... | ... | ... | ... |
| 1407 | -2.279358 | -1.420362 | 0.691389 | 0.606428 |
| 1408 | -2.289154 | -1.462599 | 0.565129 | 0.440833 |
| 1409 | -2.279220 | -1.513296 | 0.558445 | 0.439195 |
| 1410 | -2.139895 | -1.460727 | 0.607652 | 0.361835 |
| 1411 | -2.196259 | -1.613232 | 0.620250 | 0.492600 |
1412 rows × 4 columns
pca_dfb_df['hclust_4'] = pd.Series( hierarchy.cut_tree( hclust_complete, n_clusters=4 ).ravel(),
index=pca_dfb_df.index ).astype('category')
pca_dfb_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1412 entries, 0 to 1411 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pc01 1412 non-null float64 1 pc02 1412 non-null float64 2 pc03 1412 non-null float64 3 pc04 1412 non-null float64 4 hclust_4 1412 non-null category dtypes: category(1), float64(4) memory usage: 45.8 KB
pca_dfb_df.hclust_4.value_counts()
3 666 1 371 0 201 2 174 Name: hclust_4, dtype: int64
sns.catplot(data = pca_dfb_df, x='hclust_4', kind='count')
plt.show()
sns.pairplot(data=pca_dfb_df, hue='hclust_4')
plt.show()
dfb_features_clean_copy.head()
| x1 | x2 | x3 | x4 | Result | k2 | k4 | k5 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 | 1 | 3 | 4 |
| 1 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 | 1 | 3 | 4 |
| 2 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 | 1 | 3 | 4 |
| 3 | 48.306819 | 109.445722 | 22.910858 | 13.839830 | 0 | 1 | 0 | 4 |
| 4 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 | 1 | 3 | 4 |
dfb_failed_tests = dfb_features_clean_copy[dfb_features_clean_copy['Result'] == 1].copy()
dfb_failed_tests
| x1 | x2 | x3 | x4 | Result | k2 | k4 | k5 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 49.215485 | 105.647327 | 21.410531 | 14.154576 | 1 | 1 | 3 | 4 |
| 1 | 46.887333 | 112.266102 | 23.197330 | 13.064709 | 1 | 1 | 3 | 4 |
| 2 | 49.125629 | 112.323042 | 22.874947 | 13.743788 | 1 | 1 | 3 | 4 |
| 4 | 50.564504 | 109.184587 | 22.083064 | 13.954942 | 1 | 1 | 3 | 4 |
| 5 | 52.058573 | 93.272568 | 21.937216 | 13.332882 | 1 | 1 | 3 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1407 | 49.464765 | 102.967447 | 22.411845 | 13.919089 | 1 | 1 | 3 | 4 |
| 1408 | 49.408191 | 103.516814 | 21.966617 | 13.730119 | 1 | 1 | 3 | 4 |
| 1409 | 49.605196 | 103.460366 | 21.932429 | 13.790280 | 1 | 1 | 3 | 4 |
| 1410 | 49.716703 | 104.346466 | 22.059022 | 13.543388 | 1 | 1 | 3 | 4 |
| 1411 | 49.885679 | 104.690007 | 22.055338 | 13.882302 | 1 | 1 | 3 | 4 |
426 rows × 8 columns
dfb_failed_tests_k4 = dfb_failed_tests.groupby(['k4']).size().reset_index(name='Cluster Count').copy()
dfb_failed_tests_k4
| k4 | Cluster Count | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 70 |
| 2 | 2 | 59 |
| 3 | 3 | 297 |
failed2 = dfb_failed_tests_k4['Cluster Count']
failed2
0 0 1 70 2 59 3 297 Name: Cluster Count, dtype: int64
sum_of_cluster = dfb_failed_tests_k4['Cluster Count'].sum()
sum_of_cluster
426
dfb_failed_tests_k4['Proportion'] = failed2 / sum_of_cluster
dfb_failed_tests_k4
| k4 | Cluster Count | Proportion | |
|---|---|---|---|
| 0 | 0 | 0 | 0.000000 |
| 1 | 1 | 70 | 0.164319 |
| 2 | 2 | 59 | 0.138498 |
| 3 | 3 | 297 | 0.697183 |
sns.barplot(x="k4", y="Proportion", data=dfb_failed_tests_k4)
plt.show()